The Thera bank recently saw a steep decline in the number of credit card users. Credit cards are a good source of income for banks because of different kinds of fees charged by the banks. Customers’ leaving credit cards services would lead to a loss for the bank, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and understand the reasons so that bank could improve upon those areas.
As a Data scientist at Thera bank, I need to come up with a classification model that will help the bank identify the potential customers who have a higher probability of renouncing their credit cards and provide recommendations to the bank to improve its services.
To explore and visualize the data, build an optimized classification model to predict if a customer will renounce credit card services, and generate a set of insights and recommendations that will help the bank.
Each record in the database represents a customer's information. A detailed data dictionary can be found below.
Data Dictionary
# import relevant libraries
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
The nb_black extension is already loaded. To reload it, use: %reload_ext nb_black
# load the data
data = pd.read_csv("BankChurners.csv")
# check a sample of the data to make sure it came in correctly
data.sample(n=10, random_state=1)
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6498 | 712389108 | Existing Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Blue | 36 | 6 | 3 | 2 | 2570.000 | 2107 | 463.000 | 0.651 | 4058 | 83 | 0.766 | 0.820 |
| 9013 | 718388733 | Existing Customer | 38 | F | 1 | College | NaN | Less than $40K | Blue | 32 | 2 | 3 | 3 | 2609.000 | 1259 | 1350.000 | 0.871 | 8677 | 96 | 0.627 | 0.483 |
| 2053 | 710109633 | Existing Customer | 39 | M | 2 | College | Married | $60K - $80K | Blue | 31 | 6 | 3 | 2 | 9871.000 | 1061 | 8810.000 | 0.545 | 1683 | 34 | 0.478 | 0.107 |
| 3211 | 717331758 | Existing Customer | 44 | M | 4 | Graduate | Married | $120K + | Blue | 32 | 6 | 3 | 4 | 34516.000 | 2517 | 31999.000 | 0.765 | 4228 | 83 | 0.596 | 0.073 |
| 5559 | 709460883 | Attrited Customer | 38 | F | 2 | Doctorate | Married | Less than $40K | Blue | 28 | 5 | 2 | 4 | 1614.000 | 0 | 1614.000 | 0.609 | 2437 | 46 | 0.438 | 0.000 |
| 6106 | 789105183 | Existing Customer | 54 | M | 3 | Post-Graduate | Single | $80K - $120K | Silver | 42 | 3 | 1 | 2 | 34516.000 | 2488 | 32028.000 | 0.552 | 4401 | 87 | 0.776 | 0.072 |
| 4150 | 771342183 | Attrited Customer | 53 | F | 3 | Graduate | Single | $40K - $60K | Blue | 40 | 6 | 3 | 2 | 1625.000 | 0 | 1625.000 | 0.689 | 2314 | 43 | 0.433 | 0.000 |
| 2205 | 708174708 | Existing Customer | 38 | M | 4 | Graduate | Married | $40K - $60K | Blue | 27 | 6 | 2 | 4 | 5535.000 | 1276 | 4259.000 | 0.636 | 1764 | 38 | 0.900 | 0.231 |
| 4145 | 718076733 | Existing Customer | 43 | M | 1 | Graduate | Single | $60K - $80K | Silver | 31 | 4 | 3 | 3 | 25824.000 | 1170 | 24654.000 | 0.684 | 3101 | 73 | 0.780 | 0.045 |
| 5324 | 821889858 | Attrited Customer | 50 | F | 1 | Doctorate | Single | abc | Blue | 46 | 6 | 4 | 3 | 1970.000 | 1477 | 493.000 | 0.662 | 2493 | 44 | 0.571 | 0.750 |
# check the shape
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns in the data.")
There are 10127 rows and 21 columns in the data.
# check that the ID column is unique
data.CLIENTNUM.nunique()
10127
# checking for duplicate values
df = data.copy()
df = df.drop("CLIENTNUM", axis=1)
print(f"There are {df.duplicated().sum()} duplicated rows in the data.")
There are 0 duplicated rows in the data.
# check datatypes of the columns and which columns have null values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null object 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null object 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null object 5 Marital_Status 9378 non-null object 6 Income_Category 10127 non-null object 7 Card_Category 10127 non-null object 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(9), object(6) memory usage: 1.5+ MB
# check which columns have null values
df.isna().sum()[df.isna().sum() > 0]
Education_Level 1519 Marital_Status 749 dtype: int64
# check the unique values for the categorical variables
cat_cols = list(df.select_dtypes(include="object").columns)
for i in cat_cols:
print(df[i].value_counts(normalize=True))
print("-" * 50)
Existing Customer 0.839 Attrited Customer 0.161 Name: Attrition_Flag, dtype: float64 -------------------------------------------------- F 0.529 M 0.471 Name: Gender, dtype: float64 -------------------------------------------------- Graduate 0.363 High School 0.234 Uneducated 0.173 College 0.118 Post-Graduate 0.060 Doctorate 0.052 Name: Education_Level, dtype: float64 -------------------------------------------------- Married 0.500 Single 0.420 Divorced 0.080 Name: Marital_Status, dtype: float64 -------------------------------------------------- Less than $40K 0.352 $40K - $60K 0.177 $80K - $120K 0.152 $60K - $80K 0.138 abc 0.110 $120K + 0.072 Name: Income_Category, dtype: float64 -------------------------------------------------- Blue 0.932 Silver 0.055 Gold 0.011 Platinum 0.002 Name: Card_Category, dtype: float64 --------------------------------------------------
# replace values
df.Income_Category.replace("abc", "Unknown", inplace=True)
df.Income_Category.value_counts()
Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 Unknown 1112 $120K + 727 Name: Income_Category, dtype: int64
# convert object types to category type
df[cat_cols] = df[cat_cols].astype("category")
# look at the statistical summary of the data
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Customer_Age | 10127.000 | NaN | NaN | NaN | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Gender | 10127 | 2 | F | 5358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Dependent_count | 10127.000 | NaN | NaN | NaN | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Education_Level | 8608 | 6 | Graduate | 3128 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Marital_Status | 9378 | 3 | Married | 4687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Income_Category | 10127 | 6 | Less than $40K | 3561 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Card_Category | 10127 | 4 | Blue | 9436 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Months_on_book | 10127.000 | NaN | NaN | NaN | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | NaN | NaN | NaN | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | NaN | NaN | NaN | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | NaN | NaN | NaN | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | NaN | NaN | NaN | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | NaN | NaN | NaN | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | NaN | NaN | NaN | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | NaN | NaN | NaN | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | NaN | NaN | NaN | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | NaN | NaN | NaN | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | NaN | NaN | NaN | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | NaN | NaN | NaN | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
Significant observations:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
## plot histogram and boxplot for the numerical features
num_cols = list(df.select_dtypes(include=["float", "int"]).columns)
for i in num_cols:
print(i)
histogram_boxplot(df, i)
plt.show()
print(
" ****************************************************************** "
) ## To create a separator
Customer_Age
****************************************************************** Dependent_count
****************************************************************** Months_on_book
****************************************************************** Total_Relationship_Count
****************************************************************** Months_Inactive_12_mon
****************************************************************** Contacts_Count_12_mon
****************************************************************** Credit_Limit
****************************************************************** Total_Revolving_Bal
****************************************************************** Avg_Open_To_Buy
****************************************************************** Total_Amt_Chng_Q4_Q1
****************************************************************** Total_Trans_Amt
****************************************************************** Total_Trans_Ct
****************************************************************** Total_Ct_Chng_Q4_Q1
****************************************************************** Avg_Utilization_Ratio
******************************************************************
## Barplot for the categorical features
for i in cat_cols:
print(i)
labeled_barplot(df, i, perc=True)
plt.show()
print(
" ****************************************************************** "
) ## To create a separator
Attrition_Flag
****************************************************************** Gender
****************************************************************** Education_Level
****************************************************************** Marital_Status
****************************************************************** Income_Category
****************************************************************** Card_Category
******************************************************************
We will check the variables that had extreme outliers based on the univariate distributions
# check higher end outliers of Credit Limit
df[df.Credit_Limit > 30000].Income_Category.value_counts()
$80K - $120K 285 $120K + 239 $60K - $80K 77 Unknown 66 $40K - $60K 0 Less than $40K 0 Name: Income_Category, dtype: int64
# Treat Total_Amt_Chng_Q4_Q1 and Total_Cnt_Chng_Q4_Q1 outliers that are greater than 4*IQR from the median
# look at Total_Amt_Chng_Q4_Q1 first
quartiles = np.quantile(
df["Total_Amt_Chng_Q4_Q1"][df["Total_Amt_Chng_Q4_Q1"].notnull()], [0.25, 0.75]
)
tot_4iqr = 4 * (quartiles[1] - quartiles[0])
outlier_tot = df.loc[
np.abs(df["Total_Amt_Chng_Q4_Q1"] - df["Total_Amt_Chng_Q4_Q1"].median()) > tot_4iqr,
"Total_Amt_Chng_Q4_Q1",
]
print(outlier_tot.sort_values(ascending=False).count() / df.shape[0] * 100, "%")
df.loc[outlier_tot.sort_values().index]
0.543102597017873 %
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 95 | Existing Customer | 64 | M | 1 | Graduate | Married | Less than $40K | Blue | 52 | 6 | 4 | 3 | 1709.000 | 895 | 814.000 | 1.656 | 1673 | 32 | 0.882 | 0.524 |
| 1883 | Existing Customer | 37 | M | 2 | College | Married | $80K - $120K | Blue | 17 | 5 | 3 | 2 | 4631.000 | 1991 | 2640.000 | 1.669 | 2864 | 37 | 0.947 | 0.430 |
| 113 | Existing Customer | 54 | F | 0 | Uneducated | Married | Less than $40K | Blue | 36 | 2 | 2 | 2 | 1494.000 | 706 | 788.000 | 1.674 | 1305 | 24 | 3.000 | 0.473 |
| 3270 | Existing Customer | 49 | M | 3 | High School | NaN | $60K - $80K | Blue | 36 | 3 | 2 | 2 | 9551.000 | 1833 | 7718.000 | 1.675 | 3213 | 52 | 1.476 | 0.192 |
| 1570 | Existing Customer | 49 | M | 2 | NaN | Single | $60K - $80K | Blue | 38 | 4 | 1 | 2 | 2461.000 | 1586 | 875.000 | 1.676 | 1729 | 35 | 0.750 | 0.644 |
| 137 | Existing Customer | 45 | M | 4 | College | Divorced | $60K - $80K | Blue | 40 | 5 | 1 | 0 | 10408.000 | 1186 | 9222.000 | 1.689 | 2560 | 42 | 1.211 | 0.114 |
| 89 | Existing Customer | 57 | M | 2 | NaN | Married | $120K + | Blue | 45 | 5 | 3 | 3 | 5266.000 | 0 | 5266.000 | 1.702 | 1516 | 29 | 1.636 | 0.000 |
| 94 | Existing Customer | 45 | F | 3 | NaN | Married | Unknown | Blue | 28 | 5 | 1 | 2 | 2535.000 | 2440 | 95.000 | 1.705 | 1312 | 20 | 1.222 | 0.963 |
| 1689 | Existing Customer | 34 | M | 0 | Graduate | Married | $60K - $80K | Blue | 26 | 4 | 3 | 3 | 5175.000 | 977 | 4198.000 | 1.705 | 2405 | 49 | 0.885 | 0.189 |
| 15 | Existing Customer | 44 | M | 4 | NaN | NaN | $80K - $120K | Blue | 37 | 5 | 1 | 2 | 4234.000 | 972 | 3262.000 | 1.707 | 1348 | 27 | 1.700 | 0.230 |
| 336 | Existing Customer | 56 | F | 1 | Graduate | Married | Less than $40K | Blue | 38 | 4 | 3 | 3 | 2578.000 | 2462 | 116.000 | 1.707 | 1378 | 29 | 0.812 | 0.955 |
| 16 | Existing Customer | 48 | M | 4 | Post-Graduate | Single | $80K - $120K | Blue | 36 | 6 | 2 | 3 | 30367.000 | 2362 | 28005.000 | 1.708 | 1671 | 27 | 0.929 | 0.078 |
| 68 | Existing Customer | 49 | M | 2 | Graduate | Married | $60K - $80K | Blue | 32 | 2 | 2 | 2 | 1687.000 | 1107 | 580.000 | 1.715 | 1670 | 17 | 2.400 | 0.656 |
| 36 | Existing Customer | 55 | F | 3 | Graduate | Married | Less than $40K | Blue | 36 | 6 | 2 | 3 | 3035.000 | 2298 | 737.000 | 1.724 | 1877 | 37 | 1.176 | 0.757 |
| 32 | Existing Customer | 41 | M | 4 | Graduate | Married | $60K - $80K | Blue | 36 | 4 | 1 | 2 | 8923.000 | 2517 | 6406.000 | 1.726 | 1589 | 24 | 1.667 | 0.282 |
| 231 | Existing Customer | 57 | M | 2 | NaN | Married | $80K - $120K | Blue | 46 | 2 | 3 | 0 | 18871.000 | 1740 | 17131.000 | 1.727 | 1516 | 21 | 2.000 | 0.092 |
| 2565 | Existing Customer | 39 | M | 3 | Graduate | Married | $120K + | Blue | 36 | 3 | 3 | 2 | 32964.000 | 2231 | 30733.000 | 1.731 | 3094 | 45 | 1.647 | 0.068 |
| 2337 | Existing Customer | 50 | F | 2 | Graduate | Divorced | $40K - $60K | Blue | 40 | 6 | 2 | 5 | 8307.000 | 2517 | 5790.000 | 1.743 | 2293 | 36 | 0.800 | 0.303 |
| 1369 | Existing Customer | 36 | F | 2 | Uneducated | Married | Less than $40K | Blue | 36 | 4 | 2 | 2 | 4066.000 | 1639 | 2427.000 | 1.749 | 3040 | 56 | 0.931 | 0.403 |
| 33 | Existing Customer | 53 | F | 2 | College | Married | Less than $40K | Blue | 38 | 5 | 2 | 3 | 2650.000 | 1490 | 1160.000 | 1.750 | 1411 | 28 | 1.000 | 0.562 |
| 190 | Existing Customer | 57 | M | 1 | Graduate | Married | $80K - $120K | Blue | 47 | 5 | 3 | 1 | 14612.000 | 1976 | 12636.000 | 1.768 | 1827 | 24 | 3.000 | 0.135 |
| 1718 | Existing Customer | 42 | F | 4 | Post-Graduate | Single | Less than $40K | Blue | 36 | 6 | 2 | 3 | 1438.300 | 674 | 764.300 | 1.769 | 2451 | 55 | 1.292 | 0.469 |
| 1455 | Existing Customer | 39 | F | 2 | Doctorate | Married | Unknown | Blue | 36 | 5 | 2 | 4 | 8058.000 | 791 | 7267.000 | 1.787 | 2742 | 42 | 2.000 | 0.098 |
| 180 | Existing Customer | 45 | M | 2 | Uneducated | Married | $40K - $60K | Blue | 34 | 3 | 2 | 1 | 5771.000 | 2248 | 3523.000 | 1.791 | 1387 | 18 | 0.800 | 0.390 |
| 1486 | Existing Customer | 39 | M | 2 | Graduate | Married | $40K - $60K | Blue | 31 | 5 | 3 | 2 | 8687.000 | 1146 | 7541.000 | 1.800 | 2279 | 33 | 1.357 | 0.132 |
| 115 | Existing Customer | 49 | M | 1 | Graduate | Single | $80K - $120K | Blue | 36 | 6 | 2 | 2 | 18886.000 | 895 | 17991.000 | 1.826 | 1235 | 18 | 1.571 | 0.047 |
| 18 | Existing Customer | 61 | M | 1 | High School | Married | $40K - $60K | Blue | 56 | 2 | 2 | 3 | 3193.000 | 2517 | 676.000 | 1.831 | 1336 | 30 | 1.143 | 0.788 |
| 295 | Existing Customer | 60 | M | 0 | High School | Married | $40K - $60K | Blue | 36 | 5 | 1 | 3 | 3281.000 | 837 | 2444.000 | 1.859 | 1424 | 29 | 1.417 | 0.255 |
| 855 | Existing Customer | 39 | F | 2 | Graduate | Married | Unknown | Blue | 31 | 4 | 2 | 3 | 1438.300 | 997 | 441.300 | 1.867 | 2583 | 47 | 0.958 | 0.693 |
| 117 | Existing Customer | 50 | M | 3 | High School | Single | $80K - $120K | Blue | 39 | 4 | 1 | 4 | 9964.000 | 1559 | 8405.000 | 1.873 | 1626 | 25 | 0.786 | 0.156 |
| 1176 | Existing Customer | 34 | M | 2 | College | Married | $80K - $120K | Blue | 22 | 4 | 2 | 4 | 1631.000 | 0 | 1631.000 | 1.893 | 2962 | 57 | 1.111 | 0.000 |
| 869 | Existing Customer | 39 | M | 2 | College | Married | $60K - $80K | Blue | 35 | 4 | 3 | 2 | 7410.000 | 2517 | 4893.000 | 1.924 | 2398 | 37 | 1.176 | 0.340 |
| 88 | Existing Customer | 44 | M | 3 | High School | Single | $60K - $80K | Blue | 31 | 4 | 3 | 1 | 12756.000 | 837 | 11919.000 | 1.932 | 1413 | 14 | 1.800 | 0.066 |
| 6 | Existing Customer | 51 | M | 4 | NaN | Married | $120K + | Gold | 46 | 6 | 1 | 3 | 34516.000 | 2264 | 32252.000 | 1.975 | 1330 | 31 | 0.722 | 0.066 |
| 142 | Existing Customer | 54 | M | 4 | Graduate | Married | $80K - $120K | Blue | 34 | 2 | 3 | 2 | 14926.000 | 2517 | 12409.000 | 1.996 | 1576 | 25 | 1.500 | 0.169 |
| 431 | Existing Customer | 47 | F | 4 | NaN | Divorced | $40K - $60K | Blue | 34 | 6 | 1 | 2 | 3502.000 | 1851 | 1651.000 | 2.023 | 1814 | 31 | 0.722 | 0.529 |
| 1873 | Existing Customer | 38 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 36 | 5 | 2 | 3 | 3421.000 | 2308 | 1113.000 | 2.037 | 2269 | 39 | 1.053 | 0.675 |
| 1085 | Existing Customer | 45 | F | 3 | Graduate | Single | Unknown | Blue | 36 | 3 | 3 | 4 | 11189.000 | 2517 | 8672.000 | 2.041 | 2959 | 58 | 1.231 | 0.225 |
| 177 | Existing Customer | 67 | F | 1 | Graduate | Married | Less than $40K | Blue | 56 | 4 | 3 | 2 | 3006.000 | 2517 | 489.000 | 2.053 | 1661 | 32 | 1.000 | 0.837 |
| 1219 | Existing Customer | 38 | F | 4 | Graduate | Married | Unknown | Blue | 28 | 4 | 1 | 2 | 6861.000 | 1598 | 5263.000 | 2.103 | 2228 | 39 | 0.950 | 0.233 |
| 154 | Existing Customer | 53 | F | 1 | College | Married | Less than $40K | Blue | 47 | 4 | 2 | 3 | 2154.000 | 930 | 1224.000 | 2.121 | 1439 | 26 | 1.364 | 0.432 |
| 284 | Existing Customer | 61 | M | 0 | Graduate | Married | $40K - $60K | Blue | 52 | 3 | 1 | 2 | 2939.000 | 1999 | 940.000 | 2.145 | 2434 | 33 | 1.538 | 0.680 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
| 841 | Existing Customer | 37 | F | 3 | NaN | Married | Less than $40K | Blue | 25 | 6 | 2 | 1 | 1438.300 | 674 | 764.300 | 2.180 | 1717 | 31 | 0.722 | 0.469 |
| 7 | Existing Customer | 32 | M | 0 | High School | NaN | $60K - $80K | Silver | 27 | 2 | 2 | 2 | 29081.000 | 1396 | 27685.000 | 2.204 | 1538 | 36 | 0.714 | 0.048 |
| 466 | Existing Customer | 63 | M | 2 | Graduate | Married | $60K - $80K | Blue | 49 | 5 | 2 | 3 | 14035.000 | 2061 | 11974.000 | 2.271 | 1606 | 30 | 1.500 | 0.147 |
| 58 | Existing Customer | 44 | F | 5 | Graduate | Married | Unknown | Blue | 35 | 4 | 1 | 2 | 6273.000 | 978 | 5295.000 | 2.275 | 1359 | 25 | 1.083 | 0.156 |
| 658 | Existing Customer | 46 | M | 4 | Graduate | Married | $60K - $80K | Blue | 35 | 5 | 1 | 2 | 1535.000 | 700 | 835.000 | 2.282 | 1848 | 25 | 1.083 | 0.456 |
| 46 | Existing Customer | 56 | M | 2 | Doctorate | Married | $60K - $80K | Blue | 45 | 6 | 2 | 0 | 2283.000 | 1430 | 853.000 | 2.316 | 1741 | 27 | 0.588 | 0.626 |
| 47 | Existing Customer | 59 | M | 1 | Doctorate | Married | $40K - $60K | Blue | 52 | 3 | 2 | 2 | 2548.000 | 2020 | 528.000 | 2.357 | 1719 | 27 | 1.700 | 0.793 |
| 219 | Existing Customer | 44 | F | 3 | Uneducated | Divorced | Less than $40K | Silver | 38 | 4 | 1 | 3 | 11127.000 | 1835 | 9292.000 | 2.368 | 1546 | 25 | 1.273 | 0.165 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 773 | Existing Customer | 61 | M | 0 | Post-Graduate | Married | Unknown | Blue | 53 | 6 | 2 | 3 | 14434.000 | 1927 | 12507.000 | 2.675 | 1731 | 32 | 3.571 | 0.134 |
| 8 | Existing Customer | 37 | M | 3 | Uneducated | Single | $60K - $80K | Blue | 36 | 5 | 2 | 0 | 22352.000 | 2517 | 19835.000 | 3.355 | 1350 | 24 | 1.182 | 0.113 |
| 12 | Existing Customer | 56 | M | 1 | College | Single | $80K - $120K | Blue | 36 | 3 | 6 | 0 | 11751.000 | 0 | 11751.000 | 3.397 | 1539 | 17 | 3.250 | 0.000 |
Explore relationships between variables.
# correlation plot
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
# plot numerical features against each other
sns.pairplot(data=df, hue="Attrition_Flag", vars=num_cols, corner=True)
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# plot numerical variables with respect to target
sns.set(font_scale=1)
for i in num_cols:
distribution_plot_wrt_target(df, i, "Attrition_Flag")
plt.show()
print("*" * 100)
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# plot categorical variables with respect to target
othercols = cat_cols.copy()
othercols.remove("Attrition_Flag")
for i in othercols:
print(i)
stacked_barplot(df, i, "Attrition_Flag")
plt.show()
print(
" ****************************************************************** "
) ## To create a separator
Gender Attrition_Flag Attrited Customer Existing Customer All Gender All 1627 8500 10127 F 930 4428 5358 M 697 4072 4769 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** Education_Level Attrition_Flag Attrited Customer Existing Customer All Education_Level All 1371 7237 8608 Graduate 487 2641 3128 High School 306 1707 2013 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** Marital_Status Attrition_Flag Attrited Customer Existing Customer All Marital_Status All 1498 7880 9378 Married 709 3978 4687 Single 668 3275 3943 Divorced 121 627 748 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** Income_Category Attrition_Flag Attrited Customer Existing Customer All Income_Category All 1627 8500 10127 Less than $40K 612 2949 3561 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 Unknown 187 925 1112 $120K + 126 601 727 ------------------------------------------------------------------------------------------------------------------------
****************************************************************** Card_Category Attrition_Flag Attrited Customer Existing Customer All Card_Category All 1627 8500 10127 Blue 1519 7917 9436 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20 ------------------------------------------------------------------------------------------------------------------------
******************************************************************
# see how Marital_Status varies with other less significant variables since these did not have a significant difference with respect to the target
plt.figure(figsize=(20, 20))
var_cols = [
"Customer_Age",
"Dependent_count",
"Months_on_book",
"Months_Inactive_12_mon",
"Avg_Open_To_Buy",
]
for i, variable in enumerate(var_cols):
plt.subplot(7, 2, i + 1)
sns.boxplot(
df["Marital_Status"], df[variable], hue=df["Attrition_Flag"], palette="PuBu"
)
plt.tight_layout()
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.title(variable)
plt.show()
# see how Income_Category varies with other less significant variables since these did not have a significant difference with respect to the target
plt.figure(figsize=(20, 20))
var_cols = [
"Customer_Age",
"Dependent_count",
"Months_on_book",
"Months_Inactive_12_mon",
"Avg_Open_To_Buy",
]
for i, variable in enumerate(var_cols):
plt.subplot(7, 2, i + 1)
sns.boxplot(
df["Income_Category"], df[variable], hue=df["Attrition_Flag"], palette="PuBu"
)
plt.tight_layout()
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.title(variable)
plt.show()
We will apply a transformation on the highly skewed features so they are more normally distribued. This may also reduce the number of outliers.
# check min values of columns to see if any are 0 which is important to consider in log transformations
df[skew_cols].min()
Credit_Limit 1438.300 Avg_Open_To_Buy 3.000 Total_Amt_Chng_Q4_Q1 0.000 Total_Trans_Amt 510.000 Total_Ct_Chng_Q4_Q1 0.000 Avg_Utilization_Ratio 0.000 dtype: float64
# use sqrt transformation to get highly right-skewed variables more normally distributed
# create copy of data
df1 = df.copy()
# identify skewed variables that work better with sqrt transformations
sqrt_cols = [
"Avg_Open_To_Buy",
"Total_Amt_Chng_Q4_Q1",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
]
# create transformed features and plot them
for col in sqrt_cols:
df1[col + "_sqrt"] = np.sqrt(df1[col])
histogram_boxplot(df1, col + "_sqrt")
# dropping the original columns
df1.drop(sqrt_cols, axis=1, inplace=True)
# use log transformation to get highly right-skewed variables more normally distributed
# identify skewed variables that work better with log transformations
log_cols = [
"Credit_Limit",
"Total_Trans_Amt",
]
# create transformed features and plot them
for col in log_cols:
df1[col + "_log"] = np.log(df1[col])
histogram_boxplot(df1, col + "_log")
# dropping the original columns
df1.drop(log_cols, axis=1, inplace=True)
# calculate correlations of transformed features to see if we need to drop any highly correlated features
plt.figure(figsize=(15, 7))
sns.heatmap(df1.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
# create copy of data
df2 = df1.copy()
# drop highly correlated columns
df2.drop(columns=["Avg_Open_To_Buy_sqrt", "Total_Trans_Amt_log"], inplace=True)
# separating target variable from other variables
X = df2.drop(columns="Attrition_Flag")
y = df2["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 17) (2026, 17) (2026, 17)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in validation data =", X_val.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 6075 Number of rows in validation data = 2026 Number of rows in test data = 2026
print("Split of 0 and 1 in training data:\n", y_train.value_counts(normalize=True))
print("Split of 0 and 1 in validation data:\n", y_val.value_counts(normalize=True))
print("Split of 0 and 1 in test data:\n", y_test.value_counts(normalize=True))
Split of 0 and 1 in training data: 0 0.839 1 0.161 Name: Attrition_Flag, dtype: float64 Split of 0 and 1 in validation data: 0 0.839 1 0.161 Name: Attrition_Flag, dtype: float64 Split of 0 and 1 in test data: 0 0.840 1 0.160 Name: Attrition_Flag, dtype: float64
# recall which features are missing values
df2.isna().sum()[df2.isna().sum() > 0]
Education_Level 1519 Marital_Status 749 dtype: int64
# creating a list of categorical variables
categorical_features = X_train.select_dtypes(include=["category"]).columns.tolist()
# creating a transformer for categorical variables, which will first apply simple imputer and then do one hot encoding for categorical variables
# with one hot encoding, I will drop the first variable as logistic regression is affected by multicollinearity
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(drop="first")),
]
)
# creating a list of numerical variables
numerical_features = X_train.select_dtypes(
include=["int64", "float64"]
).columns.tolist()
# creating a transformer for numerical variables, which will apply standard scaling on the numerical variables
numeric_transformer = Pipeline(
steps=[
("imputer_num", SimpleImputer(strategy="median")),
("standard scaler", StandardScaler()),
]
)
# combining categorical transformer and numerical transformer using a column transformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
],
remainder="passthrough",
)
# fit the pipeline to the training data
preprocessor.fit(X_train)
# apply the pipeline to the training and test data
X_train_t = preprocessor.transform(X_train)
X_val_t = preprocessor.transform(X_val)
X_test_t = preprocessor.transform(X_test)
print(X_train_t.shape, X_val_t.shape, X_test_t.shape)
(6075, 28) (2026, 28) (2026, 28)
recall should be maximized, the greater the recall higher the chances of minimizing the false negatives.# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("logr", LogisticRegression(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
names = [] # Empty list to store name of the models
score_train = []
score_val = []
# loop through all models to get the training performance
for name, model in models:
model.fit(X_train_t, y_train)
scores = model_performance_classification_sklearn(model, X_train_t, y_train)
score_train.append(scores.T)
names.append(name)
# create dataframe of results
models_train_basic = pd.concat(score_train,axis=1)
models_train_basic.columns = [names]
# loop through all models to get the validation performance
for name, model in models:
model.fit(X_train_t, y_train)
scores = model_performance_classification_sklearn(model, X_val_t, y_val)
score_val.append(scores.T)
# create dataframe of results
models_val_basic = pd.concat(score_val,axis=1)
models_val_basic.columns = [names]
print("Training Performance Comparison:")
models_train_basic
Training Performance Comparison:
| logr | dtree | Bagging | GBM | Adaboost | Xgboost | |
|---|---|---|---|---|---|---|
| Accuracy | 0.897 | 1.000 | 0.993 | 0.943 | 0.928 | 1.000 |
| Recall | 0.519 | 1.000 | 0.960 | 0.742 | 0.719 | 0.999 |
| Precision | 0.764 | 1.000 | 0.996 | 0.887 | 0.813 | 1.000 |
| F1 | 0.618 | 1.000 | 0.978 | 0.808 | 0.763 | 0.999 |
print("\n" "Validation Performance Comparison:" "\n")
models_val_basic
Validation Performance Comparison:
| logr | dtree | Bagging | GBM | Adaboost | Xgboost | |
|---|---|---|---|---|---|---|
| Accuracy | 0.907 | 0.888 | 0.923 | 0.939 | 0.929 | 0.938 |
| Recall | 0.580 | 0.644 | 0.656 | 0.733 | 0.730 | 0.758 |
| Precision | 0.784 | 0.656 | 0.829 | 0.869 | 0.810 | 0.843 |
| F1 | 0.667 | 0.650 | 0.733 | 0.795 | 0.768 | 0.798 |
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train_t, y_train)
print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After Oversampling, the shape of X_train: {}".format(X_train_over.shape))
print("After Oversampling, the shape of y_train: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 976 Before Oversampling, counts of label 'No': 5099 After Oversampling, counts of label 'Yes': 5099 After Oversampling, counts of label 'No': 5099 After Oversampling, the shape of X_train: (10198, 28) After Oversampling, the shape of y_train: (10198,)
models_over = [] # Empty list to store all the models
# Appending models into the list
models_over.append(("logr_over", LogisticRegression(random_state=1)))
models_over.append(("dtree_over", DecisionTreeClassifier(random_state=1)))
models_over.append(("Bagging_over", BaggingClassifier(random_state=1)))
models_over.append(("GBM_over", GradientBoostingClassifier(random_state=1)))
models_over.append(("Adaboost_over", AdaBoostClassifier(random_state=1)))
models_over.append(
("Xgboost_over", XGBClassifier(random_state=1, eval_metric="logloss"))
)
names_over = [] # Empty list to store name of the models
score_train_over = []
score_val_over = []
# loop through all models to get the training performance
for name, model in models_over:
model.fit(X_train_over, y_train_over)
scores = model_performance_classification_sklearn(model, X_train_over, y_train_over)
score_train_over.append(scores.T)
names_over.append(name)
# create dataframe of results
models_train_over = pd.concat(score_train_over, axis=1)
models_train_over.columns = [names_over]
# loop through all models to get the validation performance
for name, model in models_over:
model.fit(X_train_over, y_train_over)
scores = model_performance_classification_sklearn(model, X_val_t, y_val)
score_val_over.append(scores.T)
# create dataframe of results
models_val_over = pd.concat(score_val_over, axis=1)
models_val_over.columns = [names_over]
print("Training Performance Comparison:")
models_train_over
Training Performance Comparison:
| logr_over | dtree_over | Bagging_over | GBM_over | Adaboost_over | Xgboost_over | |
|---|---|---|---|---|---|---|
| Accuracy | 0.842 | 1.000 | 0.997 | 0.961 | 0.933 | 1.000 |
| Recall | 0.845 | 1.000 | 0.997 | 0.960 | 0.940 | 1.000 |
| Precision | 0.840 | 1.000 | 0.998 | 0.961 | 0.926 | 1.000 |
| F1 | 0.842 | 1.000 | 0.997 | 0.961 | 0.933 | 1.000 |
print("\n" "Validation Performance Comparison:" "\n")
models_val_over
Validation Performance Comparison:
| logr_over | dtree_over | Bagging_over | GBM_over | Adaboost_over | Xgboost_over | |
|---|---|---|---|---|---|---|
| Accuracy | 0.834 | 0.892 | 0.919 | 0.928 | 0.903 | 0.940 |
| Recall | 0.831 | 0.730 | 0.730 | 0.794 | 0.788 | 0.782 |
| Precision | 0.490 | 0.647 | 0.758 | 0.766 | 0.669 | 0.833 |
| F1 | 0.617 | 0.686 | 0.744 | 0.780 | 0.724 | 0.807 |
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train_t, y_train)
print("Before Undersampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Undersampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Undersampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Undersampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Undersampling, the shape of X_train: {}".format(X_train_un.shape))
print("After Undersampling, the shape of y_train: {} \n".format(y_train_un.shape))
Before Undersampling, counts of label 'Yes': 976 Before Undersampling, counts of label 'No': 5099 After Undersampling, counts of label 'Yes': 976 After Undersampling, counts of label 'No': 976 After Undersampling, the shape of X_train: (1952, 28) After Undersampling, the shape of y_train: (1952,)
models_un = [] # Empty list to store all the models
# Appending models into the list
models_un.append(("logr_un", LogisticRegression(random_state=1)))
models_un.append(("dtree_un", DecisionTreeClassifier(random_state=1)))
models_un.append(("Bagging_un", BaggingClassifier(random_state=1)))
models_un.append(("GBM_un", GradientBoostingClassifier(random_state=1)))
models_un.append(("Adaboost_un", AdaBoostClassifier(random_state=1)))
models_un.append(("Xgboost_un", XGBClassifier(random_state=1, eval_metric="logloss")))
names_un = [] # Empty list to store name of the models
score_train_un = []
score_val_un = []
# loop through all models to get the training performance
for name, model in models_un:
model.fit(X_train_un, y_train_un)
scores = model_performance_classification_sklearn(model, X_train_un, y_train_un)
score_train_un.append(scores.T)
names_un.append(name)
# create dataframe of results
models_train_un = pd.concat(score_train_un, axis=1)
models_train_un.columns = [names_un]
# loop through all models to get the validation performance
for name, model in models_un:
model.fit(X_train_un, y_train_un)
scores = model_performance_classification_sklearn(model, X_val_t, y_val)
score_val_un.append(scores.T)
# create dataframe of results
models_val_un = pd.concat(score_val_un, axis=1)
models_val_un.columns = [names_un]
print("Training Performance Comparison:")
models_train_un
Training Performance Comparison:
| logr_un | dtree_un | Bagging_un | GBM_un | Adaboost_un | Xgboost_un | |
|---|---|---|---|---|---|---|
| Accuracy | 0.835 | 1.000 | 0.990 | 0.940 | 0.892 | 1.000 |
| Recall | 0.829 | 1.000 | 0.983 | 0.942 | 0.894 | 1.000 |
| Precision | 0.838 | 1.000 | 0.998 | 0.939 | 0.891 | 1.000 |
| F1 | 0.834 | 1.000 | 0.990 | 0.940 | 0.893 | 1.000 |
print("\n" "Validation Performance Comparison:" "\n")
models_val_un
Validation Performance Comparison:
| logr_un | dtree_un | Bagging_un | GBM_un | Adaboost_un | Xgboost_un | |
|---|---|---|---|---|---|---|
| Accuracy | 0.824 | 0.817 | 0.887 | 0.899 | 0.881 | 0.896 |
| Recall | 0.850 | 0.813 | 0.850 | 0.877 | 0.896 | 0.902 |
| Precision | 0.474 | 0.461 | 0.606 | 0.634 | 0.585 | 0.622 |
| F1 | 0.608 | 0.588 | 0.708 | 0.736 | 0.708 | 0.736 |
The top 3 models with good performance on training and test sets and had less overfitting are Logistic Regression, AdaBoost, and Gradient Boosting with undersampled data. We will tune these three models.
# define the model
logr_tuned = LogisticRegression(random_state=1)
# Grid of parameters to choose from
from scipy.stats import loguniform
param_grid = {
"solver": ["newton-cg", "lbfgs", "liblinear"],
"penalty": ["none", "l1", "l2", "elasticnet"],
"C": loguniform(1e-5, 100),
}
# Run the random search
randomized_cv_logr = RandomizedSearchCV(
estimator=logr_tuned,
param_distributions=param_grid,
n_jobs=-1,
n_iter=100,
scoring="recall",
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv_logr.fit(X_train_un, y_train_un)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv_logr.best_params_, randomized_cv_logr.best_score_
)
)
Best parameters are {'C': 0.003131281159444946, 'penalty': 'l1', 'solver': 'liblinear'} with CV score=0.8924437467294609:
# Set the clf to the best combination of parameters
logr_tuned = randomized_cv_logr.best_estimator_
# Fit the best algorithm to the data.
logr_tuned.fit(X_train_un, y_train_un)
# Calculating different metrics on train set
logr_train_perf = model_performance_classification_sklearn(
logr_tuned, X_train_un, y_train_un
)
print("Training performance:")
logr_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.759 | 0.892 | 0.704 | 0.787 |
# Calculating different metrics on validation set
logr_val_perf = model_performance_classification_sklearn(logr_tuned, X_val_t, y_val)
print("Validation performance:")
logr_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.680 | 0.887 | 0.321 | 0.471 |
# define the model
gb_tuned = GradientBoostingClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"init": [
AdaBoostClassifier(random_state=1),
DecisionTreeClassifier(random_state=1),
],
"n_estimators": np.arange(70, 150, 10),
"learning_rate": [0.1, 0.01, 0.05, 0.50, 0.20, 1],
"subsample": [0.3, 0.5, 0.7, 0.8, 1],
"max_features": [0.3, 0.5, 0.7, 0.8, 1],
}
# Run the random search
randomized_cv_gb = RandomizedSearchCV(
estimator=gb_tuned,
param_distributions=parameters,
n_iter=100,
scoring="recall",
cv=5,
random_state=1,
n_jobs=-1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv_gb.fit(X_train_un, y_train_un)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv_gb.best_params_, randomized_cv_gb.best_score_
)
)
Best parameters are {'subsample': 1, 'n_estimators': 90, 'max_features': 0.5, 'learning_rate': 0.2, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8954683411826269:
# Set the clf to the best combination of parameters
gb_tuned = randomized_cv_gb.best_estimator_
# Fit the best algorithm to the data.
gb_tuned.fit(X_train_un, y_train_un)
# Calculating different metrics on train set
gb_train_perf = model_performance_classification_sklearn(
gb_tuned, X_train_un, y_train_un
)
print("Training performance:")
gb_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.964 | 0.964 | 0.963 | 0.964 |
# Calculating different metrics on validation set
gb_val_perf = model_performance_classification_sklearn(gb_tuned, X_val_t, y_val)
print("Validation performance:")
gb_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.901 | 0.877 | 0.640 | 0.740 |
# define the model
ada_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
param_grid = {
"n_estimators": np.arange(10, 150, 10),
"learning_rate": [0.1, 0.01, 0.05, 0.50, 0.20, 0.90, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
DecisionTreeClassifier(max_depth=5, random_state=1),
DecisionTreeClassifier(max_depth=10, random_state=1),
DecisionTreeClassifier(max_depth=30, random_state=1),
],
}
# Run the random search
randomized_cv_ada = RandomizedSearchCV(
estimator=ada_tuned,
param_distributions=param_grid,
n_jobs=-1,
n_iter=100,
scoring="recall",
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv_ada.fit(X_train_un, y_train_un)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv_ada.best_params_, randomized_cv_ada.best_score_
)
)
Best parameters are {'n_estimators': 140, 'learning_rate': 0.9, 'base_estimator': DecisionTreeClassifier(max_depth=10, random_state=1)} with CV score=0.8975248560962846:
# Set the clf to the best combination of parameters
ada_tuned = randomized_cv_ada.best_estimator_
# Fit the best algorithm to the data.
ada_tuned.fit(X_train_un, y_train_un)
# Calculating different metrics on train set
ada_train_perf = model_performance_classification_sklearn(
ada_tuned, X_train_un, y_train_un
)
print("Training performance:")
ada_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# Calculating different metrics on validation set
ada_val_perf = model_performance_classification_sklearn(ada_tuned, X_val_t, y_val)
print("Validation performance:")
ada_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.891 | 0.902 | 0.610 | 0.728 |
# training performance comparison
models_train_comp_df = pd.concat(
[logr_train_perf.T, gb_train_perf.T, ada_train_perf.T,], axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression Tuned",
"Gradient Boosting Tuned",
"AdaBoost Tuned",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression Tuned | Gradient Boosting Tuned | AdaBoost Tuned | |
|---|---|---|---|
| Accuracy | 0.759 | 0.964 | 1.000 |
| Recall | 0.892 | 0.964 | 1.000 |
| Precision | 0.704 | 0.963 | 1.000 |
| F1 | 0.787 | 0.964 | 1.000 |
# validation performance comparison
models_val_comp_df = pd.concat(
[logr_val_perf.T, gb_val_perf.T, ada_val_perf.T,], axis=1,
)
models_val_comp_df.columns = [
"Logistic Regression Tuned",
"Gradient Boosting Tuned",
"AdaBoost Tuned",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Logistic Regression Tuned | Gradient Boosting Tuned | AdaBoost Tuned | |
|---|---|---|---|
| Accuracy | 0.680 | 0.901 | 0.891 |
| Recall | 0.887 | 0.877 | 0.902 |
| Precision | 0.321 | 0.640 | 0.610 |
| F1 | 0.471 | 0.740 | 0.728 |
# Calculating different metrics on the test set
gb_test_perf = model_performance_classification_sklearn(gb_tuned, X_test_t, y_test)
print("Test performance:")
gb_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.906 | 0.911 | 0.648 | 0.757 |
confusion_matrix_sklearn(gb_tuned, X_test_t, y_test)
# Feature importances
feature_names = pd.get_dummies(X_train, drop_first=True).columns
importances = gb_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
A preprocessor was defined in earlier steps. We will use this in our final pipeline.
# Now we already know the best model we need to process with, so we don't need to divide data into 3 sets - train, validation and test
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=1, stratify=y
)
print(X_train.shape, X_test.shape)
(8101, 17) (2026, 17)
# Creating new pipeline with best parameters
# train model on undersampled data
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
model = Pipeline(
steps=[
("pre", preprocessor),
(
"GBM",
GradientBoostingClassifier(
random_state=1,
subsample=1,
n_estimators=90,
max_features=0.5,
learning_rate=0.2,
init=AdaBoostClassifier(random_state=1),
),
),
]
)
# Fit the model on training data
model.fit(X_train_un, y_train_un)
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('imputer_num',
SimpleImputer(strategy='median')),
('standard '
'scaler',
StandardScaler())]),
['Customer_Age',
'Dependent_count',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon',
'Total_Revolving_Bal',
'Total_Tra...
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='most_frequent')),
('onehot',
OneHotEncoder(drop='first'))]),
['Gender', 'Education_Level',
'Marital_Status',
'Income_Category',
'Card_Category'])])),
('GBM',
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.2, max_features=0.5,
n_estimators=90, random_state=1,
subsample=1))])
# Creating new pipeline with best parameters and including randomundersampler in pipeline
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.pipeline import Pipeline as Pipeline_imb
model2 = Pipeline_imb(
steps=[
("pre", preprocessor),
("rus", RandomUnderSampler(random_state=1)),
(
"GBM",
GradientBoostingClassifier(
random_state=1,
subsample=1,
n_estimators=90,
max_features=0.5,
learning_rate=0.2,
init=AdaBoostClassifier(random_state=1),
),
),
]
)
# Fit the model on training data
model2.fit(X_train, y_train)
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('imputer_num',
SimpleImputer(strategy='median')),
('standard '
'scaler',
StandardScaler())]),
['Customer_Age',
'Dependent_count',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon',
'Total_Revolving_Bal',
'Total_Tra...
SimpleImputer(strategy='most_frequent')),
('onehot',
OneHotEncoder(drop='first'))]),
['Gender', 'Education_Level',
'Marital_Status',
'Income_Category',
'Card_Category'])])),
('rus', RandomUnderSampler(random_state=1)),
('GBM',
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.2, max_features=0.5,
n_estimators=90, random_state=1,
subsample=1))])
# Calculating different metrics on test set
gb_test_perf = model_performance_classification_sklearn(model2, X_test, y_test)
print("Test performance:")
gb_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.914 | 0.935 | 0.665 | 0.777 |
confusion_matrix_sklearn(model2, X_test, y_test)
Data Background:
Data Preprocessing:
Observations from EDA:
Model Building and Performance:
Models were created to predict whether or not a customer will renounce credit card services. Recall was the chosen model evaluation metric to minimize false negatives. A decision tree, logistic regression, bagging classifier, AdaBoost classifier, Gradient Boosting classifier, XGBoost Classifier were built with default parameters, with oversampled data, and with undersampled data. The top three models (logistic regression, Gradient Boosting, and AdaBoost) were then tuned with hyperparameters using random search. Gradient Boosting had the best performance with the less overfitting. The final performance gave above 0.9 on the test set. Based on the chosen Gradient Boosting model, count of transactions and total revolving balance are the most significant variables for determining whether a customer will churn credit card services. Finally, a pipeline was created that preprocessed categorical and numerical features of split data, fit a model, and made predictions.